The Open Pediatric Cancer Project

Background: In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we extend the OpenPBTA to create the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens). Findings: We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research- grade integrated diagnoses for these tumors. Conclusions: OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

. We harmonized, aggregated, and analyzed data from multiple pediatric and adult data sources, building upon the work of the OpenPBTA (Figure 1).Biospecimen-level metadata and clinical data are contained in Supplemental Table 1.

Figure 1: OpenPedCan Data. A, OpenPedCan contains multi-omic data from seven cohorts of pediatric tumors (A-B) with counts by tumor event, RNA-Seq from adult tumors from The Cancer Genome Atlas (TCGA) Program (C-D) and RNA-Seq from normal adult tissues from the Genotype-Tissue Expression (GTeX) project (E) with counts by specimen. (Abbreviations: TARGET = Therapeutically Applicable Research to Generate Effective Treatments , PPTC = Pediatric Preclinical Testing Consortium, PBTA = Pediatric Brain Tumor Atlas, Maris = Neuroblastoma cell lines from the Maris Laboratory at CHOP, GMKF = Gabriella Miller Kids First, DGD = Division of Genomic Diagnostics at CHOP, CPTAC = Clinical Proteomic Tumor Analysis Consortium)
OpenPedCan currently include the following datasets, described more fully below: Open Pediatric Brain Tumor Atlas (OpenPBTA) In September of 2018, the Children's Brain Tumor Network (CBTN) released the Pediatric Brain Tumor Atlas (PBTA), a genomic dataset (whole genome sequencing, whole exome sequencing, RNA sequencing, proteomic, and clinical data) for nearly 1,000 tumors, available from the Gabriella Miller Kids First Portal.In September of 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) Project was launched.OpenPBTA was a global open science initiative to comprehensively define the molecular landscape of tumors of 943 patients from the CBTN and the PNOC003 DIPG clinical trial from the Pediatric Pacific Neuro-oncology Consortium through real-time, collaborative analyses and collaborative manuscript writing on GitHub [1].Additional PBTA data has been, and will be continually added to, OpenPedCan.

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) Initiative is an NCI-funded collection of disease-specific projects that seeks to identify the genomic changes of pediatric cancers.The overall goal is to collect genomic data to accelerate the development of more effective therapies.OpenPedCan analyses include the seven diseases present in the TARGET dataset: Acute Lymphoblastic Leukemia (ALL), Acute Myeloid Leukemia (AML), Clear cell sarcoma of the kidney, Neuroblastoma, Osteosarcoma, Rhabdoid tumor, and Wilm's Tumor.

Gabriella Miller Kids First (Neuroblastoma) and PBTA
The Gabriella Miller Kids First Pediatric Research Program (Kids First) is a large-scale effort to accelerate research and gene discovery in pediatric cancers and structural birth defects.The program includes whole genome sequencing (WGS) from patients with pediatric cancers and structural birth defects and their families.OpenPedCan analyses include Neuroblastoma and PBTA data from the Kids First projects.

Chordoma Foundation
The Chordoma Foundation seeks to advance research and improve healthcare for patients diagnosed with chordoma and has shared patient and model sequencing data with the CBTN.

Pediatric Preclinical Testing Consortium (PPTC)
The National Cancer Institute's (NCI) former PPTC, now the Pediatric Preclinical in Vivo Testing (PIVOT) Program, molecularly and pharmacologically characterizes cell-derived and patientderived xenograft (PDX) models.OpenPedCan includes re-harmonized RNA-Seq data for 244 models from the initial PPTC study [2].

MI-ONCOSEQ Study [3]
These clinical sequencing data from the University of Michigan were donated to CBTN and added to the PBTA cohort.

Division of Genomic Diagnostics at Children's Hospital of Philadelphia (DGD)
CHOP's Division of Genomic Diagnostics has partnered with CCDI to add somatic panel sequencing data to OpenPedCan and the Molecular Targets Platform.

The Genotype-Tissue Expression Project (GTEx)
The GTEx project is an ongoing effort to build a comprehensive public data resource and tissue bank to study tissue-specific gene expression, regulation and their relationship with genetic variants.

Clinical Proteomic Tumor Analysis Consortium (CPTAC) PBTA proteomics study
The CPTAC pediatric pan-brain tumor study [4] contains 218 tumors profiled by proteogenomics and are included in OPC.

CPTAC adult GBM proteomics study
This CPTAC adult GBM study [5] contains 99 tumors profiled by proteogenomics and are included in OPC.

Project HOPE proteomics study
Project HOPE is an adolescent and young adult high-grade glioma study (in preparation for publication) that contains 90 tumors profiled by proteogenomics and are included in OPC.

Context
Creation of this dataset had multiple motivations.First, we sought to harmonize, summarize, and contextualize pediatric cancer genomics data among normal tissues (GTEx) and adult cancer tissues (TCGA) to enable the creation of the National Cancer Institute's Molecular Targets Platform (MTP) at https://moleculartargets.ccdi.cancer.gov/.Next, we created this resource for broad community use to promote rapid reuse and accelerate the discovery of additional mechanisms contributing to the pathogenesis of pediatric cancers and/or to identify novel candidate therapeutic targets for pediatric cancer.
Similar to OpenPBTA, OpenPedCan operates on a pull request model to accept contributions.We set up continuous integration software via GitHub Actions to confirm the reproducibility of analyses within the project's Docker container.We maintained a data release folder on Amazon S3, downloadable directly from S3 or our open-access CAVATICA project, with merged files for each analysis.As we produced new results, identified data issues, or added additional data, we created new data releases in a versioned manner.The project maintainers include scientists from the Center for Data-Driven Discovery in Biomedicine and formerly the Department of Biomedical and Health Informatics at the Children's Hospital of Philadelphia.

Methods
An overview of the OpenPedCan methods is depicted in Figure 2. Briefly, most primary harmonization analysis workflows were performed with Kids First pipelines written in Common Workflow Language (CWL) using CAVATICA (detailed below).Alignment and expression quantification for GTEx and TCGA RNA-Seq was performed by the respective consortium.Custom python, R, and/or bash scripts were then created in OpenPedCan using the primary harmonized output files.Files derived from the primary analysis workflows (green) are released within OpenPedCan.Additional analysis modules developed within OpenPedCan (red) also generate results files (green) which are released within OpenPedCan.

Method Details
Nucleic acids extraction and library preparation (PBTA X01 and miRNA-Seq) For detailed methods about the OpenPBTA cohort, please refer to the manuscript [1].For the PBTA X01 cohort, libraries were prepped using the Illumina TruSeq Strand-Specific Protocol to pull out poly-adenylated transcripts.

cDNA Library Construction
Total RNA was quantified using the Quant-iT™ RiboGreen® RNA Assay Kit and normalized to 5ng/ul.Following plating, 2 uL of ERCC controls (using a 1:1000 dilution) were spiked into each sample.An aliquot of 325 ng for each sample was transferred into library preparation.The resultant 400bp cDNA went through dual-indexed library preparation: 'A' base addition, adapter ligation using P7 adapters, and PCR enrichment using P5 adapters.After enrichment, the h libraries were quantified using Quant-iT PicoGreen (1:200 dilution).Samples were normalized to 5 ng/uL.The sample set was pooled and quantified using the KAPA Library Quantification Kit for Illumina Sequencing Platforms.

miRNA Extraction and Library Preparation
Total RNA for CBTN samples was extracted as described in OpenPBTA [1] and prepared according to the HTG Edge Seq protocol for the extracted RNA miRNA Whole transcriptome assay (WTA).15ng of RNA were mixed in 25ul of lysis buffer, which were then loaded onto a 96-well plate.Human Fetal Brain Total RNA (Takara Bio USA, #636526) and Human Brain Total RNA (Ambion, Inc., Austin, TX, USA) were used as controls.The plate was loaded into the HTG EdgeSeq processor along with the miRNA WTA assay reagent pack.Samples were processed for 18-20 hours, then were barcoded and amplified using a unique forward and reverse primer combination.PCR settings used for barcoding and amplification were 95C for 4 min, 16 cycles of (95C for 15 sec, 56C for 45 sec, 68C for 45 sec), and 68C for 10 min.Barcoded and amplified samples were cleaned using AMPure magnetic beads (Ampure XP,Cat# A63881).Libraries were quantified using the KAPA Biosystem assay qPCR kit (Kapa Biosystems Cat#KK4824) and CT values were used to determine the pM concentration of each library.

Data generation
PBTA X01 Illumina Sequencing Pooled libraries were normalized to 2nM and denatured using 0.1 N NaOH prior to sequencing.Flowcell cluster amplification and sequencing were performed according to the manufacturer's protocols using the NovaSeq 6000.Each run was a 151bp paired-end with an eight-base index barcode read.Data was analyzed using the Broad Picard Pipeline which includes de-multiplexing and data aggregation.
PBTA miRNA Sequencing Libraries were pooled, denatured, and loaded onto sequencing cartridge.Libraries were sequenced using an Illumina Nextseq 500 per manufacturer guidelines.FASTQ files were generated from raw sequencing data using Illumina BaseSpace and analyzed with the HTG EdgeSeq Parser software v5.4.0.7543 to generate an excel file containing quantification of 2083 miRNAs per sample.Any sample that did not pass the quality control set by the HTG REVEAL software version 2.0.1 (Tuscon, AR, USA) was excluded from the analysis.

DNA WGS Alignment and SNP Calling
Please refer to the OpenPBTA manuscript for details on DNA WGS Alignment, prediction of participants' genetic sex, and SNP calling for B-allele Frequency (BAF) generation.[1].

Somatic Mutation and INDEL Calling
For matched tumor/normal samples, we used the same mutation calling methods as described in OpenPBTA manuscript for details [1].For tumor only samples, we ran Mutect2 from GATK v4.2.2.0 using the following workflow.
Strelka2 outputs multi-nucleotide polymorphisms (MNPs) as consecutive single-nucleotide polymorphisms.In order preserve MNPs, we gather MNP calls from the other caller inputs, and search for evidence supporting these consecutive SNP calls as MNP candidates.Once found, the Strelka2 SNP calls supporting a MNP are converted to a single MNP call.This is done to preserve the predicted gene model as accurately as possible in our consensus calls.
Consensus SNV from all four callers were collected and by default, calls that were detected in at least two calling algorithms or marked with "HotSpotAllele" were retained.
For all SNVs, potential non-hotspot germline variants were removed if they had a normal depth <= 7 and gnomAD allele frequency > 0.001.Final results were saved in MAF format.

Somatic Copy Number Variant (CNV) Calling
We called copy number variants for tumor/normal samples using Control-FREEC [11,12] and CNVkit [13] as described in the OpenPBTA manuscript [1].We used GATK [14] to call CNVs for matched tumor/normal WGS samples when there were at least 30 male and 30 female normals from the same sequencing platform available for panel of normal creation.For tumor only samples, we used Control-FREEC with the following modifications.Instead of the b-allele frequency germline input file, we used the dbSNP_v153_ucsccompatible.converted.vt.decomp.norm.common_snps.vcf.gzdbSNP common snps file and to avoid hard-to-call regions, utilized the hg38_canonical_150.mappabilitymappability file.Both are also linked in the public Kids First references CAVATICA project.The Control-FREEC tumor only workflow can be found here.

Somatic Structural Variant Calling (WGS samples only)
Please refer to the OpenPBTA manuscript for details [1].

Methylation Analysis Methylation array preprocessing
We preprocessed raw Illumina 450K and EPIC 850K Infinium Human Methylation Bead Array intensities using the array preprocessing methods implemented in the minfi Bioconductor package [15].We utilized either preprocessFunnorm when an array dataset had both tumor and normal samples or multiple OpenPedcan-defined cancer_groups and preprocessQuantile when an array dataset had only tumor samples from a single OpenPedcan-defined cancer_group to estimate usable methylation measurements (beta-values and m-values) and copy number (cn-values).Some Illumina Infinium array probes targeting CpG loci contain single-nucleotide polymorphisms (SNPs) near or within the probe [16], which could affect DNA methylation measurements [17].As the minfi preprocessing workflow recommends, we dropped probes containing common SNPs in dbSNP (minor allele frequency > 1%) at the CpG interrogation or the single nucleotide extensions.
Details of methylation array preprocessing are available in the OpenPedCan methylationpreprocessing module.

Methylation classification of brain tumor molecular subtypes
The

Gene Expression
The tumor-normal-differential-expression module performs differential expression analyses for all sets of Disease (cancer_group) and Dataset (cohort) across all genes found in the gene-expression-rsem-tpm-collapsed.rdstable.The purpose of this analysis is to highlight the correlation and understand the variability in gene expression in different cancer conditions across different histological tissues.For OpenPedCan v12 data release, this module performs expression analysis over 102 cancer groups across 52 histological tissues for all 54,346 genes found in the dataset.This analysis was performed on the Children's Hospital of Philadelphia HPC and was configured to use 96G of RAM per CPU, with one task (one iteration of expression analysis for each set of tissue and cancer group) per CPU (total 102x52=5304 CPUs) using the R/DESeq2 package.Please refer to script run-tumor-normaldifferential-expression.sh in the module for additional details on Slurm processing configuration.The same analysis can also be performed on CAVATICA, but requires further optimization.The module describes the steps for CAVATICA set up, and scripts to publish an application on the portal.The required data files are also available publicly on CAVATICA under the Open Pediatric Cancer (OpenPedCan) Open Access.Refer to the module for detailed description and scripts.

Abundance Estimation
Among the data sources used for OpenPedCan, GTEx and TCGA used GENCODE v26 and v36, respectively.Therefore, the gene symbols had to be harmonized to GENCODE v39 for compatibility with the rest of the dataset.The liftover process was done via a custom script.The script first constructs an object detailing the gene symbol changes from the HGNC symbol database.Using the symbol-change object, the script updates any columns containing gene symbols.This liftover process was used on GTEx RNA-Seq, TCGA RNA-Seq, DGD fusions, and DNA hotspot files.
Additionally, the gene expression matrices had some instances where multiple Ensembl gene identifiers mapped to the same gene symbol.This was dealt with by filtering the expression matrix to only genes with [FPKM/TPM] > 0 and then selecting the instance of the gene symbol with the maximum mean [FPKM/TPM/Expected_count] value across samples.This enabled many downstream modules that require RNA-seq data have gene symbols as unique gene identifiers.Refer to collapse-rnaseq module for scripts and details.

Gene fusion detection from RNA-Seq
Gene fusions were called using Arriba [19] and STAR-Fusion [20] as previously reported in OpenPBTA [1].We updated the annoFuseData R package to liftover gene symbols to be concordant with VEP v. 105.Fusions are now filtered with annoFuse [21] upstream and released in fusion-annoFuse.tsv.gz.

Gene fusion detection from fusion panels (DGD only)
Clinical RNA fusion calls from the CHOP DGD fusion panel are included in the data release in the fusion-dgd.tsv.gzfile.

Splicing quantification
To detect alternative splicing events, we utilized rMATS turbo (v.4.1.0)with Ensembl/GENCODE v39 GFF annotations using the Kids First RNA-Seq workflow.We used -variable-read-length and -t paired options and applied an additional filter to include only splicing events with total junction read counts greater than 10.

CPTAC PBTA, CPTAC GBM, and HOPE proteogenomics
The following methods are the general proteomics approaches used for the CPTAC PBTA [4], CPTAC GBM [5], and HOPE (pre-publication, correspondence with Dr. Pei Wang) studies.For specific descriptions of sample preparation, mass spectrometry instrumentation and approaches, and data generation, processing, or analysis please refer to the relevant publications.

TMT-11 Labeling and Phosphopeptide Enrichment
Proteome and phosphoproteome analysis of brain cancer samples in the CPTAC PBTA (pediatric), CPTAC GBM (adult), and HOPE (adolescent and young adult, AYA) cohort studies were structured as TMT11-plex experiments.Tumor samples were digested with LysC and trypsin.Digested peptides were labeled with TMT11-plex reagent and prepared for phosphopeptide enrichment.For each dataset, a common reference sample was compiled from representative samples within the cohort.Phosphopeptides were enriched using Immobilized Metal Affinity Chromatography (IMAC) with Fe3+-NTA-agarose bead kits.

Liquid Chromatography with Tandem Mass Spectrometry (LC-MS/MS) Analysis
To reduce sample complexity, peptide samples were separated by high pH reversed phase HPLC fractionation.For CPTAC PBTA a total of 96 fractions were consolidated into 12 final fractions for LC-MS/MS analysis.For CPTAC GBM and HOPE cohorts a total of 96 fractions were consolidated into 24 fractions.For CPTAC PBTA, global proteome mass spectrometry analyses were performed on an Orbitrap Fusion Tribrid Mass Spectrometer and phosphoproteome analyses were performed on an Orbitrap Fusion Lumos Tribrid Mass Spectrometer.For CPTAC GBM and HOPE studies, mass spectrometry analysis was performed using an Orbitrap Fusion Lumos Mass Spectrometer.

Protein Identification
The CPTAC PBTA spectra data were analyzed with MSFragger version 20190628 [22] searching against a CPTAC harmonized RefSeq-based sequence database containing 41,457 proteins mapped to the human reference genome (GRCh38/hg38) obtained via the UCSC

Protein Quantification and Data Analysis
Relative protein (gene) abundance was calculated as the ratio of sample abundance to reference abundance using the summed reporter ion intensities from peptides mapped to the respective gene.For phosphoproteomic datasets, data were not summarized by protein but left at the phosphopeptide level.Global normalization was performed on the gene-level abundance matrix (log2 ratio) for global proteomic and on the site-level abundance matrix (log2 ratio) for phosphoproteomic data.The median, log2 relative protein or peptide abundance for each sample was calculated and used to normalize each sample to achieve a common median of 0.
To identify TMT outliers, inter-TMT t-tests were performed for each individual protein or phosphopeptide.Batch effects were checked using the log2 relative protein or phosphopeptide abundance and corrected using the Combat algorithm [26].Imputation was performed after batch effect correction for proteins or phosphopeptides with a missing rate < 50%.For the phosphopeptide datasets, 440 markers associated with cold-regulated ischemia genes were filtered and removed.

Creation of OpenPedCan Analysis modules
Gene Set Variation Analysis (gene-set-enrichment-analysis analysis module) Please refer to the OpenPBTA manuscript for details [1].

Fusion prioritization (fusion_filtering analysis module)
The fusion_filtering module filters artifacts and annotates fusion calls, with prioritization for oncogenic fusions, for the fusion calls from STAR-Fusion and Arriba.After artifact filtering, fusions were prioritized and annotated as "putative oncogenic fusions" when at least one gene was a known kinase, oncogene, tumor suppressor, curated transcription factor, on the COSMIC Cancer Gene Census List, or observed in TCGA.Fusions were retained in this module if they were called by both callers, recurrent or specific to a cancer group, or annotated as a putative oncogenic fusion.Please refer to the module linked above for more detailed documentation and scripts.

Consensus CNV Calling (WGS samples only) (copy_number_consensus_call* analysis modules)
We adopted the consensus CNV calling described in OpenPBTA manuscript [1] with minor adjustments.For each caller and sample with WGS performed, we called CNVs based on consensus among Control-FREEC [11,12], CNVkit [13], and GATK [14].Sample and consensus caller files with more than 2,500 CNVs were removed to de-noise and increase data quality, based on cutoffs used in GISTIC [27].For each sample, we included the following regions in the final consensus set: 1) regions with reciprocal overlap of 50% or more between at least two of the callers; 2) smaller CNV regions in which more than 90% of regions were covered by another caller.For GATK, if a panel of normal was not able to be created (required 30 male and 30 female with the same sequencing platform), consensus was run for that tumor using Control-FREEC, CNVkit, and MantaSV.We defined copy number as NA for any regions that had a neutral call for the samples included in the consensus file.We merged CNV regions within 10,000 bp of each other with the same direction of gain or loss into single region.
Any CNVs that overlapped 50% or more with immunoglobulin, telomeric, centromeric, segment duplicated regions, or that were shorter than 3000 bp were filtered out.The CNVKit calls for WXS samples were appended to the consensus CNV file.

Focal Copy Number Calling (focal-cn-file-preparation analysis module)
Please refer to the OpenPBTA manuscript for details on assignment of copy number status values to CNV segments, cytobands, and genes [1].We applied criteria to resolve instances of multiple conflicting status calls for the same gene and sample, which are described in detail in the focal-cn-file-preparation module.Briefly, we prioritized 1) non-neutral status calls, 2) calls made from dominant segments with respect to gene overlap, and 3) amplification and deep deletion status calls over gain and loss calls, respectively, when selecting a dominant status call per gene and sample.These methods resolved >99% of duplicated gene-level status calls.

Mutational Signatures (mutational-signatures analysis module)
We obtained mutational signature weights (i.e., exposures) from consensus SNVs using the deconstructSigs R package [28].We estimated weights for single-and double-base substitution (SBS and DBS, respectively) signatures from the Catalogue of Somatic Mutations in Cancer (COSMIC) database versions 2 and 3.3, as well as SBS signatures from Alexandrov et al. 2013 [29].The following COSMIC SBS signatures were excluded from weight estimation in all tumors: 1) sequencing artifact signatures, 2) signatures associated with environmental exposure, and 3) signatures with an unknown etiology.Additionally, we excluded therapy-associated signatures from mutational signature weight estimation in tumors collected prior to treatment (i.e."Initial CNS Tumor" or "Primary Tumor").

Tumor Mutation Burden [TMB] (tmb-calculation analysis module)
Recent clinical studies have associated high TMB with improved patient response rates and survival benefit from immune checkpoint inhibitors [30].
The Tumor Mutation Burden (TMB) tmb-calculation module was adapted from the snvcallers module of the OpenPBTA project [1].Here, we use mutations in the snvconsensus-plus-hotspots.maf.tsv.gzfile which is generated using Kids First DRC Consensus Calling Workflow and is included in the OpenPedCan data download.The consensus MAF contains SNVs or MNVs called in at least 2 of the 4 callers (Mutect2, Strelka2, Lancet, and Vardict) plus hotspot mutations if called in 1 of the 4 callers.We calculated TMB for tumor samples sequenced with either WGS or WXS.Briefly, we split the SNV consensus MAF into SNVs and multinucleotide variants (MNVs).We split the MNV subset into SNV calls, merged those back with the SNVs subset, and then removed sample-specific redundant calls.

All mutation TMB
For WGS samples, we calculated the size of the genome covered as the intersection of Strelka2 and Mutect2's effectively surveyed areas, regions common to all variant callers, and used this as the denominator.WGS_all_mutations_TMB = (total # mutations in consensus MAF) / intersection_strelka_mutect_vardict_genome_size For WXS samples, we used the size of the WXS bed region file as the denominator.WXS_all_mutations_TMB = (total # mutations in consensus MAF)) / wxs_genome_size

Coding only TMB
We generated coding only TMB from the consensus MAF as well.We calculated the intersection for Strelka2 and Mutect2 surveyed regions using the coding sequence ranges in the GENCODE v39 gtf supplied in the OpenPedCan data download.We removed SNVs outside of these coding sequences prior to implementing the TMB calculation below: WGS_coding_only_TMB = (total # coding mutations in consensus MAF) / intersection_wgs_strelka_mutect_vardict_CDS_genome_size For WXS samples, we intersected each WXS bed region file with the GENCODE v39 coding sequence, sum only variants within this region for the numerator, and calculate the size of this region as the denominator.WXS_coding_only_TMB = (total # coding mutations in consensus MAF) / intersection_wxs_CDS_genome_size Finally, we include an option (nonsynfilter_focr) to use specific nonsynonymous mutation variant classifications recommended from the TMB Harmonization Project.

Molecular Subtyping
Here, we build upon the molecular subtyping performed in OpenPBTA [1] to align with WHO 2021 subtypes [31].Molecular subtypes were generated per tumor event and are listed for each biospecimen in Supplemental Table S1, with the number of tumors grouped by broad histology and molecular subtype in Supplemental Table S2.

Neuroblastoma tumors
Neuroblastoma (NBL) tumors with a pathology diagnosis of neuroblastoma, ganglioneuroblastoma, or ganglioneuroma were subtyped based on their MYCN copy number status as either "NBL, MYCN amplified" or "NBL, MYCN non-amplified".If pathology_free_text_diagnosis was "NBL, MYCN non-amplified" and the genetic data suggested MYCN amplification, the samples were subtyped as "NBL, MYCN amplified".On the other hand, if pathology_free_text_diagnosis was "NBL, MYCN amplified" and the genetic data suggested MYCN non-amplification, the RNA-Seq gene expression level of MYCN was used as a prediction indicator.In those cases, samples with MYCN gene expression above or below the cutoff (TPM >= 140.83 based on visual inspection of MYCN CNV status) were subtyped as "NBL, MYCN amplified" and "NBL, MYCN non-amplified", respectively.MYCN gene expression was also used to subtype samples without DNA sequencing data.If a sample did not fit none of these situations, it was denoted as "NBL, To be classified".

TP53 Alteration Annotation (tp53_nf1_score analysis module)
Please refer to the OpenPBTA manuscript for details [1].

Selection of independent samples (independent-samples analysis module)
For analyses that require all input biospecimens to be independent, we use the OpenPedCananalysis independent-samples module to select only one biospecimen from each input participant.For each input participant of an analysis, the independent biospecimen is selected based on the analysis-specific filters and preferences for the biospecimen metadata, such as experimental strategy, cancer group, and tumor descriptor.

Data Validation and Quality Control
We ran NGSCheckMate [35] to confirm tumor/normal sample matches as described in the OpenPBTA manuscript [1] and excluded mismatched samples.We also ran somalier relate [36] to identify potential mismatched samples.We required that at least 20M total reads with 50% of RNA-Seq reads mapped to the human reference for samples to be included in analysis.We required at least 20X coverage for tumor DNA samples to be included in this analysis.

Re-use potential
OpenPedCan serves as a community resource whose outputs and/or code can be leveraged directly to ask research questions or serve as an orthogonal validation dataset.We encourage re-use of the data, ideas and suggestions for improving the data or adding analyses, and/or direct code contributions through a pull-request.Further, the analysis modules can be run within the project Docker container locally or on EC2 and scaled as the data size increases.
Software versions are documented in Supplemental Table 3.

Data Availability Datasets
The datasets supporting this study are available as follows: The TARGET dataset is available in dbGAP under phs000218.v23.p8[37].

Figure 2 :
Figure2: OpenPedCan Analysis Workflow.Depicted are the datasets (yellow, orange, and grey) contained within OpenPedCan.These datasets are made available in a harmonized manner through primary analysis workflows (blue) for DNA, RNA, and/or proteogenomics data.Files derived from the primary analysis workflows (green) are released within OpenPedCan.Additional analysis modules developed within OpenPedCan (red) also generate results files (green) which are released within OpenPedCan.

Figure 3 :
Figure 3: Medulloblastoma Sample Clustering.A, UMAP projection of 271 MB tumors and B, 63 SHH-activated MB tumors using methylation beta values of the 20,000 most variable probes from the Infinium MethylationEPIC array.C, UMAP projection of MB, SHH activated samples indicating copy number status of SHH subgroup known somatic driver genes CCND2, GLI2, MYCN, and PTEN.

The Cancer Genome Atlas Program (TCGA)
[18]ical Methylation Unit Laboratory of Pathology at the National Cancer Institute Center for Cancer Research ran the DKFZ brain classifier version 12.6, a comprehensive DNA methylation-based classification of CNS tumors across all entities and age groups[18]and/or the Bethesda Brain tumor classifier v2.0 (NIH_v2) and the combo reporter pipeline v2.0 on docker container trust1/bethesda:latest.Unprocessed IDAT-files from the Children's Brain Tumor Network (CBTN) Infinium Human Methylation EPIC (850k) BeadChip arrays were used as input and the following information was compiled into the histologies.tsvfile: dkfz_v12_methylation_subclass (predicted methylation subtype), dkfz_v12_methylation_subclass_score (classification score),